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Abstract. I show how Markov chain sampling with the Metropolis-Hastings algorithm can be 
modified so as to take bigger steps when the distribution being sampled from has the characteristic 
that its density can be quickly recomputed for a new point if this point differs from a previous point 
only with respect to a subset of "fast" variables. I show empirically that when using this method, 
the efficiency of sampling for the remaining "slow" variables can approach what would be possible 
using Metropolis updates based on the marginal distribution for the slow variables. 

1 Introduction 

Suppose we wish to sample from a distribution ir(x, y) oc exp(— E(x, y)), where E is a given "energy" 
function, by simulating a Markov chain with tt as its equilibrium distribution. Let's suppose that 
x is a "slow" variable and y is a "fast" variable, so that once E(x, y) has been computed (and 
intermediate quantities cached), we can compute E(x, y') much faster than we can compute E(x', y') 
for some x' for which we haven't previously calculated E. 

I was led to consider this problem because it arises with Bayesian models that attempt to infer 
cosmological parameters from data on the cosmic microwave background radiation (Lewis and 
Bridle 2002), for which recomputation after changing only fast variables can be around a thousand 
times faster than recomputation after changing a slow variable. Similarly large differences between 
fast and slow variables can arise with Gaussian process classification models (Neal 1999), in which 
updating the latent variables is fast, while updating the parameters of the covariance matrix is slow, 
since the new covariance matrix must then be inverted. Computationally equivalent problems also 
arise in geostatistics (Diggle, Tawn, and Moyeed 1998), and for what are called "generalized linear 
mixed effects models". Many other statistical problems also have some variables that are faster 
than others, though not always by such a large factor. 

Ideally, we would like to do Metropolis-Hastings updates for x (Metropolis, et al 1953; Hastings 
1970), using some proposal distribution, S(x*\x), and accepting or rejecting x* based on its marginal 
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distribution, ir(x). The acceptance probability for such a proposal would be 



a(x,x*) = min 1, 



S(x\x*) tt(x*) 
S(x*\x) 7r(x) 



(1) 



Suppose, however, that we can't feasibly compute the marginal distribution, tt(x), so that this 
approach is not possible. Instead we will have to use a Metropolis-Hastings algorithm that operates 
on the joint distribution for x and y. If we could sample directly from the conditional distribution 
for y, ir(y\x), we could generate x* from S(x*\x) and then y* from ir(y*\x*), and the resulting 
acceptance probability for (x*,y*) would be the same (due to cancellation) as that above using the 
marginal distribution for x. However, let's assume that sampling from n(y\x) is also infeasible. We 
might hope to approximate Tr(y\x) by some transition distribution T(y*\y;x) that we can sample 
from. To use this approximation in a Metropolis-Hasting proposal, however, we would need to be 
able to compute the probability of proposing y*, which will likely be impossible if we have to resort 
to iterative methods (eg, Markov chain simulation) in order to obtain a good approximation. 

This paper describes a way in which these problems can be bypassed when recomputing E(x, y) 
after changing only the "fast' variable y is much quicker than recomputing E(x, y) after changing 
x. In this method, changes to x are made in conjunction with changes to y that are found by 
"dragging" y with the help of intermediate transitions that involve only fast re-computations of 
E. In the limit as the number of such intermediate transitions increases, I show empirically (but 
haven't proved) that the method is equivalent to using the marginal distribution of x. Since the 
intermediate transitions involve only fast computations, we hope to be able to do quite a few 
intermediate transitions, and get close to the effect of using the marginal probabilities for x. 

The method can be seen as a generalization of "tempered transitions" (Neal 1996), and could 
be expressed in greater generality than I have done here, where I concentrate on the context with 
fast and slow variables. To begin, I'll describe the method when there is only one intermediate 
transition, since this is easier to work with, but I expect that one would use many intermediate 
transitions in practice, as described later. 

2 The method with one intermediate transition 

If the current state is (x,y), we start by generating a proposed new value x* according to the 
probabilities S(x*\x). We then define a distribution, p, over values for y that is intermediate 
between vr(y|x) and 7r(y|x*), as follows: 



Here, the dependence of p on x and x* has been made explicit, but note that this is a distribution 
over y only, not x and y jointly. We choose some transition probabilities, T, for updating y so as 
to leave p invariant. These probabilities must of course depend on x and x*. We write them as 
T(y'\y;x,x*). We require that they satisfy detailed balance, so that for all x, x*, y, and y', 



We also require that T depend symmetrically on the two x values — for all x, x*, y and y': 



p(y;x,x*) oc exp(-(£(x, y) + E(x*, y))/2) 



(2) 



p(y;x,x*)T(y'\y;x,x*) = p(y';x,x*)T(y\y';x,x*) 



(3) 



T(y'\y;x,x*) 



T(y'\y;x*,x) 



(4) 
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T might, for example, be a Metropolis-Hastings update, or a series of such updates. We apply this 
transition once, to sample a value y* from T(y*\y; x, x*). We then accept (x*,y*) as the next state 
with probability a(x, y, x* , y*), defined as follows: 



a{x,y,x ,y 



mm 



mm 



mm 



1, 



1, 



1, 



S(x\x*)ir(x* ,y*) p(y;x,x*) 
S(x*\x) tt(x, y) p(y*;x, x*) 

S(x\x*) exp(-£(x*,y*)) exp(-(£(x, y) + E(x*, y))/2) 
S(x*\x) exp(—E(x,y)) exp(—(E(x,y*) + E(x* ,y*))/2) 



S(x\x*] 
S{x*\x] 



exp 



E{x,y) +E(x,y* 



E(x*,y)+E(x*,y*] 



(5) 
(6) 
(7) 



If we don't accept, the next state is the current state, (x,y). 



Although this expression for a(x,y,x*,y*) has four occurrences of £"(-, •), only two slow evalua- 
tions are needed. In fact, only one slow evaluation is needed if we assume that an evaluation was 
done previously for the current state, when it was proposed. Note also that we would often choose 
a symmetric proposal distribution for x, so that S(x*\x)/ S(x\x*) = 1. 

To show that this is a valid update, I will prove that it satisfies detailed balance. The probability 
of moving from (x,y) to a different state (x*,y*) when in equilibrium is 

S(x\x*)ir(x*,y*) p(y;x,x*) 



ir(x, y) S(x* \x) T(y*\y; x, x*) min 



1, 



mm 



mm 



mm 



S(x*\x)ir{x,y)T(y*\y;x,x 



S(x*\x) tt(x, y) p(y*;x, x*) 

S(x\x*) 7r(x*,y*) p(y; x, x*)T(y*\y; x, x*) 



p(y*;x,x*) 

S{x*\x) vr(x, y) T(y* \y; x, x*), S(x\x*) vr(x*, y*) T(y\y*;x, x*) 
S(x*\x) vr(x, y) T(y*\y; x, x*), S(x\x*) ir(x*,y*) T(y\y*; x*,x) 



(8) 
(9) 
(10) 



Here, the detailed balance condition © and symmetry condition Q have been used. Examination 
of the above shows that swapping (x,y) and (x*,y*) leaves it unchanged, showing the detailed 
balance holds. 

I would expect this method to work better than the simple method of just proposing to change 
from x to x* while keeping y unchanged. The latter method will work well only if the old y is often 
suitable for the new x* — ie, if the old y is typical of 7r(y|x*). This will often be true only if the 
change from x to x* is small. The new method changes y to a y* that is drawn approximately (if T 
works well) from a distribution that is halfway between ir(y\x) and ir(y\x*). Such a y* should have 
a better chance of being suitable for x*, allowing the change from x to x* to be greater while still 
maintaining a good acceptance probability. If we propose an x* that is a really big change from x, 
however, even a y* that comes from a distribution halfway to 7r(y|x*) may not be good enough. 



3 The method with many intermediate transitions 

We can try to take bigger steps in x by "dragging" y through a series of intermediate distributions 
interpolating between ir(y\x) and ir(y\x*). Given some integer n > 1, we define the following 
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distributions, for i = 0, . . . , n: 

Pi(y;x,x*) (x exp(- ((l-i/n)E(x,y) + (i/n)E(x* ,y))) (11) 

Notice that po(y,x,x*) = ir(y\x) and p n (y;x,x*) = ir(y\x*). When n = 2, pi is the same as the p 
defined above in (J2J). Finally, note that pi(y;x,x*) = p n ^i(y; x*, x). 

For each pi, we need to choose transition probabilities, Ti, which may depend on x and x*. We 
require that they satisfy detailed balance, so that for all x, x*, y, and y', 

Pi(y;x,x*)Ti(y'\y;x,x*) = pi(y';x,x*)Ti(y\y';x,x*) (12) 

We also require of each opposite pair of transitions, T and T n -i, that for all x, x* , y and y', 

Ti(y'\y;x,x*) = T n -i(y'\y,x*,x) (13) 

These conditions will be satisfied if the Ti are standard Metropolis updates with respect to the pi, 
with Ti and T n _j using the same proposal distribution. 

The update procedure using n — 1 intermediate distributions is as follows. If the current state 
is x, we first propose a new x* according to the probabilities S(x*\x). We then generate a series 
of values y%, . . . , y n -i, with yi being drawn according to the probabilities Ti(yi\yi-i;x,x*). Let 
V* = Vn-ii an d define yo = y. We accept (x*,y*) as the new state of the Markov chain with the 
following probability: 



a(x,y,x*,y*,yi,...,y n - 2 ) = min 



1 S(x\x*)ir(x*,y*) jj p, i //,_ i : r. x ) 



mm 



S(x*\x)ir(x,y) p^y^x.x*) 

1 71—1 ^ 71—1 



Sfxlx*) 
F7 «i n ex P 



5(x*|x) 



7=0 



7 = 



(14) 



(15) 



To show that this is a valid update, I will show that the probability in equilibrium of the chain 
moving from (x,y) to a different state (x*,y*) while generating intermediate states y\, . . . ,y n ~2 is 
equal to the probability of the chain moving from (x*,y*) to (x,y) while generating intermediate 
states y n -2i ■■■ iVi- Detailed balance then follows by summing over possible sequences of interme- 
diate states. The probability of moving from (x, y) to (x*, y*) via y%, . . . , y n -2 can be written as 
"n-l 



7r(x, y) S(x*\x) 



Y[ T i(yi\yi-i;x,x*) 



a(x,y,x*,y*,yi, . . .,y n -2) 



mm 



71-1 



S(x*\x)ir(x,y) Tj (y* |yi_i ; x, x*), 

Pi(yi-i; x, x*) Ti{yi\yi-.\]x, x*) 



i=i 

n-l 



S(x\x*)n(x*,y*)U 



mm 



mm 



i=i 

71-1 



Pi(yi;x,x* 



n-l 



S(x*\x)ir(x,y) T i (y i \y i -. 1 ;x,x*), S(x\x*) vr(x*, y*) J| Ti(y. 



i=l 

71-1 



i=l 
n-l 



S(x*|x) 7r(x, y) iK^I^-i; x, x*), S(x|x*) 7r(x*, y*) T n _i(yj_i j^; x*, : 
i=i i=i 



(16) 
(17) 
(18) 
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If we swap x and x*, y and y*, and yi and y n _j_i, reverse the order of the two products, and swap 
the arguments of min, we see that this expression is unchanged, showing that the reverse transition 
from (x*, y*) to (x, y) via y n -2, ■ ■ ■ , Hi is equally likely. 



4 Tests on simple distributions 

I first tested the dragging method on a simple distribution in which x and y are both one- 
dimensional, with 7r(x,y) defined by the following energy function: 

E(x,y) = x 2 + 50 (1 + x 2 ) 2 (y - sin(x)) 2 (19) 

Examination of this shows that the conditional distribution for y given x is Gaussian with mean 
sin(x) and standard deviation 0.1/(l+x 2 ). From this, one can deduce that the marginal distribution 
for x can be obtained with an energy function of x 2 + log(l + x 2 ). For this test problem, we can 
therefore compare performance using dragging transitions to the "ideal" performance when doing 
Metropolis updates based on this marginal distribution. Figure^shows a sample of points obtained 
in this way, with y values filled in randomly from their conditional distribution given x. 

For purposes of this test, we can pretend that computing sin(x) is much slower than any of the 
other computations involved in evaluating E(x, y), or in the mechanics of performing Markov chain 
updates. This will make x a "slow" variable, whereas y will be a "fast" variable. We also pretend 
that we don't know that x and y are positively correlated. This mimics situations in which we are 
first exploring the distribution, or in which the relationship between x and y is non-monotonic, so 
that no linear transformation is helpful. 

Figure[2shows the efficiency of six sampling methods applied to this distribution, as measured by 
the autocorrelations for x at lags up to 30. All the methods are based on the Metropolis algorithm 
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with Gaussian proposals centred on the current state. In all cases, the standard deviation of the 
Gaussian proposals was adjusted to be approximately optimal. All the methods require only one 
slow computation of sin(x) for each iteration (for the Marginal Metropolis method, this would be 
needed only when filling in y values to go with the x values). 

In the Joint Metropolis method, the proposals change x and y simultaneously and independently, 
with the standard deviations for each being 0.5. The rejection rate for these proposals was 87%. 
In the Single-variable Metropolis method, two Metropolis updates are done each iteration, one for 
x only, the other for y only. The standard deviations for these proposals were both 0.25. The 
rejection rates were 59% for x and 64% for y. For the Marginal Metropolis method, where the state 
consists of x alone, the proposals had standard deviation of 1.0, and the rejection rate was 47%. 
Clearly, the Marginal Metropolis method performs much better than the other two, though in real 
problems it would typically be infeasible. 

The remaining plots show the autocorrelations when sampling using updates that drag y while 
changing x, with 20, 100, and 500 intermediate distributions. For all three plots, the proposal 
distribution for x had standard deviation 1.0, while the proposal distributions for y during the 
intermediate transitions had standard deviation 0.2. The rejection rate for the "inner" updates 
of y was around 60% for all three runs. The rejection rates for the "outer" updates of x were 
76%, 63%, and 52% for 20, 100, and 500 intermediate distributions. Both the rejection rate and 
the autocorrelations seem to be approaching the "ideal" values seen with the Marginal Metropolis 
method. Provided that recomputing E(x,y) after changing y is around a thousand times faster 
than recomputing it after changing x, updates for x using dragging transitions will be almost as 
good as updates based on the marginal distribution of x. 

To see how sensitive these results are to the dimensionality of the fast parameter, I did a second 
test introducing another fast parameter, z. The energy function used was 

E(x,y) = x 2 + 50(l + x 2 ) 2 (y-sin(x)) 2 + 12.5 (z - yf (20) 

This produces marginal distributions for (x, y) and for x that are the same as for the first test. 

Figure El shows the efficiency of the six sampling methods applied to this distribution. The same 
proposal standard deviations were used as in the first test, except that for the Joint Metropolis 
updates, the standard deviations were 0.3, producing a rejection rate of 85%. The dragging transi- 
tions were done using Joint Metropolis updates for y and z as the inner transitions, with proposal 
standard deviations of 0.2. 

As can be seen, all methods perform less well with the extra variable, except for the Marginal 
Metropolis method, which is the same as in the first test. The dragging transitions are less affected, 
however. The autocorrelation time (one plus twice the sum of the autocorrelations at all lags) when 
using 500 intermediate distributions increased from approximately 7.4 to approximately 9.3 with 
the addition of z. In contrast, the autocorrelation time for the Joint Metropolis updates increased 
from approximately 75 to approximately 205, and that for the Single-variable Metropolis updates 
went from approximately 230 to approximately 365. 

The programs (written in R) used for these tests are available from my web page. 
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Figure 2: Estimated autocorrelations for x at lags up to 30 when sampling for the first test problem 
using six methods. 
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Figure 3: Estimated autocorrelations for x at lags up to 30 when sampling for the second test 
problem using six methods. 
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